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Abstract 



Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a 
randomized algorithm based on Baycsian ideas, and has recently generated significant interest 
after several studies demonstrated it to have better empirical performance compared to the state 
of the art methods. However, many questions regarding its theoretical performance remained 
open. In this paper, we design and analyze Thompson Sampling algorithm for the contextual 
multi-armed bandit problem with linear payoff functions, when the contexts are provided by 
an adaptive adversary. This is perhaps the most important and widely studied version of the 
contextual bandits problem. We prove a high probability regret bound of 0(^\ / T 1+e d) in time 
T for any < e < 1 , where d is the dimension of each context vector and e is a parameter used 
by the algorithm. Our results provide the first theoretical guarantees for the contextual version 
of Thompson Sampling, and are close to the lower bound of f2(VTd) for this problem. This 
essentially solves the COLT open problem of Chapelle and Li [COLT 2012] regarding regret 
bounds for Thompson Sampling for contextual bandits problem. 

Our version of Thompson sampling uses Gaussian prior and Gaussian likelihood function. 
Our novel martingale-based analysis techniques also allow easy extensions to the use of more 
general distributions, satisfying certain general conditions. 







1 Introduction 



Multi-armed bandit (MAB) problems model the exploration/exploitation trade-off inherent in many 
sequential decision problems. There are many versions of multi-armed bandit problems; a partic- 
ularly useful version is the contextual multi-armed bandit problem. In this problem, on each of T 
rounds, a learner is presented with the choice of taking one of N actions, referred to as N arms. 
Before making the choice of which arm to play, the learner sees a d-dimensional feature vector 
bi, referred to as "context", associated with each arm i. The learner uses these feature vectors 
along with the feature vectors and rewards of arms played by her in the past to make the choice 
of arm. Over time, the learner's aim is gather enough information about how the feature vectors 
and rewards are related to each other, so that she can predict, with some certainty, which arm 
will give the best reward by looking at the feature vectors. The learner competes with a class of 
predictors, in which each predictor takes in the feature vectors and predicts which arm will give 
the best reward. If the learner can guarantee to do nearly as well as the predictions of the best 
predictor in hindsight (to have low regret), the learner is said to successfully compete with that 
class. 

In the contextual bandits setting with linear payoff functions, the learner competes with the class 
of all "linear" predictors on the feature vectors. That is, a predictor is defined by N (/-dimensional 
parameters {fj,i}fL l , and the predictor ranks the arms according to bffii. We consider the contextual 
bandit problem under the linear realizability assumption, that is, we assume that there are unknown 
underlying parameters {Lii}f =l such that the expected reward for each arm i, given context bi, is 
bj Hi. Under this realizability assumption, the linear predictor corresponding to {^i}f = i is in fact 
the best predictor and the learner's aim is to learn these underlying parameters. This realizability 
assumption is standard in the existing literature on contextual multi-armed bandits [H [TT| [9j Q] . 

In this paper, we analyze Thompson Sampling (TS) algorithm for the contextual bandits prob- 
lem with linear payoffs. Thompson Sampling is one the earliest heuristics for the multi-armed 
bandit problems. The first version of this Bayesian heuristic is around 80 years old, dating to 
Thompson (1933) [25]. Since then, it got rediscovered numerous times independently in the con- 
text of reinforcement learning, e.g., in [271 l20l I24j . It is a member of the family of randomized 
probability matching algorithms. The basic idea is to assume a simple prior distribution on the 
underlying parameters of the reward distribution of every arm, and at every time step, play an arm 
according to its posterior probability of being the best arm. The general structure of Thompson 
sampling for the contextual bandits problem involves the following elements: 

1. a set of parameters fx; 

2. an assumed prior distribution P{fx) on these parameters; 

3. past observation T> consisting of (context b, reward r) for the past time steps; 

4. an assumed likelihood function P(r\b, fx), which gives the probability of reward given a context 
b and a parameter fx; 

5. a posterior distribution P(fx\T>) oc P(T>\fx)P(fx), where P(T>\jx) is the likelihood function. 

In each round, TS plays an arm according to its posterior probability of maximizing the expected 
reward. A simple way to achieve that is to produce a sample of reward for each arm, using the 
posterior distributions, and play the arm that produces the largest sample. We emphasize that 
although TS algorithm is a Bayesian approach, the description of the algorithm and our analysis 
apply to the prior-free stochastic MAB model, and are directly comparable to the UCB family of 
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algorithms which are a frequentist approach to the same problem. One could interpret the Bayesian 
priors used by the TS algorithm as a way of capturing the current knowledge about the arms. 

Recently, TS has attracted considerable attention. Several studies (e.g., [T3j [22 ], \12[ \7\ [T9 l [T5]) 
have empirically demonstrated the efficacy of TS: Scott [22] provides a detailed discussion of prob- 
ability matching techniques in many general settings along with favorable empirical comparisons 
with other techniques. Chapelle and Li [7] demonstrate that for the basic stochastic MAB prob- 
lem, empirically TS achieves regret comparable to the lower bound of [16]; and in applications like 
display advertising and news article recommendation modeled by the contextual bandits problem, 
it is competitive to or better than the other methods such as UCB. In their experiments, TS is also 
more robust to delayed or batched feedback than the other methods. TS has also been used in an 
industrial scale application for CTR prediction of search ads on search engines [12] . Kaufmann et 
al. [15] do a thorough comparison of TS with the best known versions of UCB, and show that TS 
has the lowest regret in the long run. 

Despite being easy to implement and being competitive to the state of the art methods, the 
theoretical understanding of TS algorithm is limited. [131 [18] provided weak guarantees, namely, 
a bound of o(T) on expected regret in time T. More recently, some significant progress was made 
by [SI [15] , who provided near-optimal problem-dependent bounds on the expected regret of TS for 
the basic (i.e. without contexts) version of the stochastic MAB problem. However, many questions 
regarding theoretical analysis of TS remained open, including near-optimal problem-independent 
regret bounds, high probability regret bounds, and regret bounds for the more general contextual 
bandits setting. Some of these questions were formally raised as a COLT 2012 open problem [8]. In 
this paper, we use novel and simple martingale-based analysis techniques to demonstrate that TS 
achieves high probability, near-optimal problem independent regret bounds for contextual bandits 
with linear payoffs. To our knowledge, ours are the first non-trivial regret bounds for TS for the 
contextual bandits problem. Additionally, our results are the first high probability regret bounds 
for TS, even in the case of basic MAB problem. This essentially solves the COLT 2012 open 
problem [8] for linear contextual bandits. 

The contextual MAB problem does not seem easily amenable to the techniques used so far for 
analyzing the basic MAB problem by [3 [15]. In Section \2. 31 we describe some of the challenges, and 
our martingale-based solution ideas to handle them. Our version of Thompson Sampling algorithm, 
described formally in Section 12. 2\ uses Gaussian prior and Gaussian likelihood functions. As we 
discuss towards the end of the paper in Section HJ our techniques are easily extensible to the use of 
other prior distributions, satisfying certain conditions. 

1.1 Our Results 

The formal problem statement appears in Sec. 12.11 

Theorem 1. For the contextual bandit problem with linear payoffs, with probability 1 — 5, the 
total regret in time T for Thompson Sampling is bounded by O (^\J m ~~N In T In , for any 
< e < 1. Here, e is a parameter used by the Thompson Sampling algorithm. 

Theorem 2. When /Ui = /X2 • • • = = i-£- there is a single underlying d- dimensional parameter 
[i, then with probability 1 — 5, the total regret in time T for Thompson Sampling is bounded by 
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Remark 1. Here < e < 1 can be chosen to be any constant. If T is known, one could choose 
e = j^rp, to get 0(dV NT) (and 0(dy/T)) regret bound. 

Remark 2. Note that Theorem\M has only a logarithmic dependence on the number of arms N, 
which makes it particularly useful when the number of arms N is very large, but there is a single 
underlying d-dimensional parameter (jl, with d being much smaller than N . One could also recover 
the setting with different /ijS from the setting with a single \i, by letting \x be an N d-dimensional 
vector formed by appending all the /ijS, and letting context 6j(t) be an N d-dimensional vector which 
is in all but d positions corresponding to arm i. However, a direct application of Theorem^ would 
then give a slightly weaker bound of 0(-^NdVT 1+e ) compared to Theorem^ 

We will mainly describe the algorithm and regret analysis for the setting of single parameter 
fx = hi = ■ ■ ■ UN-, i- e - the proof of Theorem [2j The algorithm and analysis for the setting with 
different fas is similar. In Section U we describe the changes required for the latter setting in order 
to get the result of Theorem [TJ 

1.2 Related Work 

The contextual bandit problem with linear payoffs is a widely studied problem in statistics and 
machine learning often under different names as mentioned by Chu et al. [9j: bandit problems with 
covariates |26^ [2T] , associative reinforcement learning |14j , associative bandit problems [H [23] , and 
bandit problems with expert advice [5]. The name contextual bandits was coined in Langford and 
Zhang [H]. 

Chu et al. [9] show that for any algorithm the regret is Q(y/Td) for d 2 < T for the iV- armed 
contextual bandits problem with linear payoffs and single parameter. Auer [1] and Chu et al. [9] 
SupLinUCB, a complicated algorithm using UCB as a subroutine, for this problem. Chu et al. 

achieve a regret bound of O ( v/ Td In 3 (NT ln(T) /5)) with probability at least 1 — 6 (Auer [4] proves 
similar results). Let us compare these results with ours. Our bounds have a factor of d compared 
to yd in the bounds just mentioned. As can be observed in our regret analysis, the extra yd 
factor in our bounds appears because we will use a concentration inequality that provides only a 

concentration of 0(\J d In ^) for the empirical estimate of the mean rewards around the actual mean. 
The advantage of using this (weaker though more generally applicable) concentration inequality is 
that it allows for statistical dependence between the samples used in the estimates of the mean 
rewards, which could be due to the dependence between the past rewards and the future choice of 
arms, or because the contexts are generated by an adaptive adversary. By contrast, in the analysis 
of SupLinUCB, [H |9] consider only oblivious adversary, and achieve statistical independence of 
samples by using a complicated master procedure SupLin on top of the basic UCB style algorithm. 

This allows them to use a stronger concentration of 0(ym~^) given by the Azuma-Hoeffding 
inequality. We do not use any such master algorithm. 

A closely related setting is that of linear stochastic bandits problem, e.g. [101 [Q. In linear 
stochastic bandits problem, every arm i is associated with a known fixed vector bi, and the expected 
reward of the arm, when played, is bffj, for some common unknown underlying parameter [i- Abbasi- 
Yadkori et al. PQ analyze a UCB-style algorithm for that problem. When adapted to our setting, 
their regret bound is O (d log (T)\/T + \J dT log (T/6)). Note that their regret bound does not 
depend on N, and thus can even be applied to infinite set of arms, for example, when the set 
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of arms is specified as those corresponding to all vectors in a d-dimensional polytope. The lower 
bound for this setting was given by Dani et al. [TO] as ^l(dVT). The state-of-the-art bounds for 
linear bandits problem in case of finite ./V are given by [6]. They provide an algorithm based on 
exponential weights, with regret of order \J dT log N for any finite set of N actions. However, their 
setting is slightly different from ours. They consider a non-stochastic (adversarial) bandit setting 
where the reward at time t for arm i is bf [it with fit chosen by an adversary. The set of arms and 
the associated b, vectors are non-adaptive and fixed in advance. 

While the results in this paper do not claim to provide regret bounds for Thompson Sampling 
algorithm that match or better the best available bounds of this extensively studied problem, 
our bounds for this natural and efficient heuristic are close to the best bounds. Our bounds are 
essentially within a factor of VdlnT of the best bounds for finite ./V (those for UCB1 by [HI S], 
and for Exp2 algorithm by [6]), and within VhiiV factor of the best bounds that do not depend on 
N (by Abbasi-Yadkori et al. [1]). The main contribution of this paper is to provide tools for the 
analysis of Thompson Sampling algorithm for contextual bandits, which despite of being popular 
and empirically attractive, has eluded theoretical analysis. While significant recent progress was 
made in analyzing it for basic MAB [31 [15], it was not clear how to extend that to contextual 
bandits problem, for which no regret bounds were available. There were considerable difficulties in 
extending the existing techniques to this case, some of which were also pointed out in [8]. We believe 
the techniques used in this paper will provide useful insights into the workings of this Bayesian 
algorithm, and may be useful for further improvements and extensions. 



2 Problem setting and algorithm description 
2.1 Problem setting 

There are N arms. At time t = 1,2,..., a context vector bi(t) £ R d , ||&i(i)|| < 1, is revealed 
for every arm i. These context vectors are chosen by an adversary in an adaptive manner after 
observing the arms played and their rewards up to time t — 1, i.e. history Ht-i, 

H t -i = {i{w),r^ w) (w),bi(w),i = 1, . . . , N, w = 1, . . . , t - 1}, 

where i(t) denotes the arm played at time t. 

Given bi(t), reward for arm i at time t is generated from an (unknown) distribution with mean 
bi{t) T Hi, where m € M d , \\fii\\ < 1 are fixed but unknown parameters. Also, given history Ht-i, 
and bi(t),i = 1, . . . , N, reward for arms i, i', i ^ %' are independent of each other. 

E [nit) | {h{t)}f =1 ,nt-i] =E[n(t) | b^t)] = h{t) T ^ 

Furthermore, we assume that r/j^ = Vi(t) — bi(t) T ^ is conditionally i?-sub-Gaussian for constant 

" °' 1-e "' VA G R,E[e A ^{Mi)}iIi,^t-i] < exp (^) . 

This assumption is satisfied if Ti{t) £ \bi{t) T ^ — R,bi(t) T fj,i + R] (refer to Remark 1 in Appendix 
A.l of [11]). Note that this assumption is weaker than assuming ri(t) is bounded. 

An algorithm for the contextual bandit problem needs to choose, at every time t, an arm i(t) to 
play, using history Tit-i and current contexts i = 1, . . . , N. Let i*(t) denote the optimal arm 
at time t, i.e. i*(t) = argmaxj bi(t) T /j,i. Then the regret at time t, 

regret(i) = 6i*(i)(t) T /ii«(t) - bj( f ) {t) T Hi{t) ■ 
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The objective is to minimize the total regret 1Z(T) = Ylt=i regret (t) in time T. The time horizon 
T is finite but possibly unknown. 

Remark 3. An alternative definition of regret that appears in literature is 

When the reward ri{t) at all time steps is bounded as \ri{t)\ < R for some constant R, for all i, 
then we can obtain the same results as in Theorem \J\ and Theorem [H for this definition of regret. 
The details are provided in Section \3.l\ 

2.2 Thompson Sampling algorithm 

Here, we describe the algorithm for the setting of single parameter fj, = m = ■ ■ ■ (For the case 
of N different parameters, see Section 01) Since there is a single underlying parameter, TS will 
maintain a common prior distribution over this parameter. 

We use Gaussian likelihood function and Gaussian prior in our version of Thompson Sampling 
algorithm. More precisely, we assume that the likelihood of reward ri(t) at time t, given context 
bi(t) and parameter \i, is given by the pdf of Gaussian distribution Af(bi(t) T fi,v 2 ). Here, v = 

R\J §dln( j), with e £ (0, 1) which parameterizes our algorithm. Let 

B(t) =/d + EL _ ii^)H i 'i(^)( u; ) T ; A(*) = B ( t Y 1 (El _ ii&i(™)(™) r iH(™)) • 

Then, assuming that the prior for \x at time t is given by N(fi(t),v 2 B{t)~ 1 ), it easy to compute 
the posterior distribution 

Vv{fi\ n {t)) oc PT(n(t)\bi(t) T fl)Px(fi) 

as N(fi(t + l),v 2 B(t + 1)~ ) (details of this computation are in Appendix lA)) . Or, equivalently, the 
posterior distribution of the mean reward bi(t + 1) T fi{t + 1) for arm i is M(b i (t + l) T fi(t + l),v 2 b i (t + 
lfBit + l^biit + l)). In our Thompson Sampling algorithm, for each arm i, we will generate an 
independent sample 9i(t) from the distribution Af(bi(t) T ft(t),v 2 bi(t) T B(t) bi(t)) at time t. The 
arm with maximum value of 9i(t) will be played. 
Algorithm 1: Thompson Sampling for Contextual bandits 
Set B = I d ,p, = d , f = d . 
foreach t = 1, 2, . . . , do 

For each arm i = 1, . . . , N, sample 6i(t) independently from distribution 
ATihitf^v^itfB-Xit)). 

Play arm i(t) := argmaxj0j(t) and observe reward rt- 
Update B = B + b i{t) (t)b i{t) (t) T , f = f + b i{t) (t)r t , ft = B' x f. 
end 



Remark 4. Note that in the case of a single underlying parameter \x, one could alternatively first 
generate a single fi from distribution N(fi(t + 1), v 2 B(t + and then generate 9i(t) as bi(t) T fi. 

This alternative algorithm could be more efficient if N is large, but there is an efficient way to 
compute maxj bi(t) T fl. While in this alternative algorithm, the marginal distribution of each 9i{t) 
remains Af(bi(t + l) T fi(t + l),v 2 bi(t + l) T B(t + l)~ l bi(t + l)), 9{(t)s are not independent anymore. 
In our proof, we utilize the independence of 9i(t)s (in Lemma{^), and it is not clear to us at this 
point whether the algorithm with dependent 9iS will have the same regret. 

For the case of N different parameters, this distinction is not important, as a separate fn has to be 
generated for every i. 
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2.3 Challenges and solution outline 

The contextual version of the multi-armed bandit problem presents new challenges for the analysis 
of TS algorithm, and the techniques used so far for analyzing the basic multi-armed bandit problem 
by [31 [15] do not seem directly applicable. Let us describe some of these difficulties and our novel 
solution ideas to resolve them. 

In the basic MAB problem there are N arms, each with mean reward /ij, and the regret for 
playing a suboptimal arm i is /v — /ij, where i* is the arm with highest mean. Let us compare this 
to a 1-dimensional contextual MAB problem, where each arm i is associated with a parameter m, 
but in addition, at every time t, it is associated with a context bi(t), so that mean reward is bi(t)fj,i, 
the best arm i*(t) at time t is the arm with the highest mean at time t, and the regret for playing 
arm i is - h{t)fii. 

In general, the basis of regret analysis for stochastic MAB is to prove that the variance of 
empirical estimates for all arms decreases fast enough, so that the regret incurred until the variance 
becomes small enough, is small. In the basic MAB, the variance of the empirical mean is inversely 
proportional to the number of plays k{{t) of arm i at time t. Thus, every time a suboptimal 
arm is played, we know that even though a regret of — jjLi < 1 in incurred, there is also an 
improvement of exactly 1 in the number of plays of that arm, and hence, corresponding decrease 
in the variance. The techniques for analyzing basic MAB rely on this observation to precisely 
quantify the exploration-exploitation tradeoff. On the other hand, the variance of empirical mean 
for contextual case is given by inverse of Bi(t) = Ylu=i bu{t) 2 - When a suboptimal arm i is played, 
if bi(t) is small, the regret &i*(t)(i)/Uj*(t) — bi(t)/j,i could be much higher than the improvement bi(t) 
in Bi(t). 

In our solution, we overcome this difficulty by bounding the expected regret at every step by 
a function of the probability of playing the optimal arm at that step. So, a high expected regret 
would mean large expected number of plays of optimal arm, in turn implying that regret is small. 
More precisely, we prove that, for "most histories" Tt-i, 

^—^^Efregret^l^-!] < ~ Pr (i(t) = i*(t) \ F t ^) s^ (t) + s tm . 

This inequality will form the basis for establishing our super-martingale process. Here filtration 
J-t-i will be defined as the union of history Ht-i and the contexts bi(t),i = 1, . . . , N at time t. 



And, p = t(T) = R^dHir) + !> S M = vW^PRt)- 

The main idea behind proving the above inequality is to divide the arms into two groups at any 
given time : 

• unsaturated arms defined as those with Aj(t) := 6j*(t)(t) T \i — 6«(t) T /i < {sf^J^NT) v + 

£(T))8t,i, 

• saturated arms defined as those with Aj(i) > ( v / 41n(A r T) v + t(T))s t)i . 

Note that sti gives the standard deviation of the estimate bi(t) T fi(t) and of 9i(t). Thus, intuitively 
saturated arms are arms with the property that the estimates of the means constructed so far in 
the direction of their contexts are good, making the deviations st^s small enough — significantly 
smaller than their current Aj(f). 

If an unsaturated arm is played at time t, then regret is at most Ajw(t) < (\/4 ln(A^T) v + 
£(T))s t ^ t y For saturated arms, the regret can be large, but on the other hand, since their deviation 
sti is small, the concentration of 9{(t) and bi(t) T fi(t) will ensure that with reasonable probability, 
the algorithm is able to distinguish between them and the optimal arm. In particular, we prove 
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that the probability of playing a saturated arm is within p of the probability of playing the optimal 
arm. Further, using concentration bounds for Oi(t) and fi(t), the regret at any time can be bounded 
by (i/41n(JVT) v + l{T))(s t)i , (t) + s t)i ( t) ) to get the desired inequality. 

Then, using the Azuma-Hoeffding inequality for super-martingales, it will follow that with high 
probability, 

K{T) = ELi regret(t) < {y/4\n(NT) v + £{T)) (| £f =1 = f(t)) St)i . (t) + £t s M(t) ) 

< (V4ln(iVr) i; + £(T)) (j £ t *t,i(t) + Et «*,<(*)) 

Then, we will use the inequality Et s t,i(t) = O(VTd) (derived along the lines of [1]), to get the 
desired regret bound. 



3 Regret Analysis: Proof of Theorem [2] 



Definition 1. Define £(T) = R^dln^) + 1, v = RJ^dln(^). And for all i, define s tji = 

\f bi(t) T B{t)~ l bi(t) , Ai(t) = 6j*( 4 )(t) T /i — bi(t) T fi. Also define filtration Ft as the union of history 
until time t, and the contexts at time t + 1, i.e., Tt = {Ht, h(t + 1), i = 1, . . . , iV}. 



Definition 2. An arm i is called saturated at time t if Aj(i) > (y / '41n(iVT) v + £(T))st,i, and 
unsaturated otherwise. Let C(t) denote the set of saturated arms at time t. Note that the optimal 
arm at time t is always unsaturated at time t, i.e., i*(t) £ C(t), and an arm may keep shifting from 
saturated to unsaturated and vice-versa over time. 

Definition 3. Define E(t) and E(t) as the events that bi{t) T (i(t) and 9i{t) are concentrated around 
their respective means. More precisely, define E(t) as the event that 

Vi : \bi{t) T fi{t) - bi(t) T fi\ < £(T)s t ,i. 

Define Ei{t) as the event that 

\6i(t) - bi(t) T fi(t)\ < y/Un(NT) V8t,i, 
and E(t) as the event thatVi,Ei(t) holds. 

Lemma 1. For all t, < 5 < 1, Pr(E(t)) > 1 — And, for all possible filtrations Ft—i, 
Vt,Pr(E i (t)|.F t _ 1 ) > 1 - jfcs, and Pr(E(t)\F t ^) > 1 - ^. 

Proof. The complete proof of this lemma appears in Appendix IB. 21 The probability bound for 
E(t) will be proven using the concentration inequality given by Theorem 1 in [1]). The probability 
bound for E{{t) will be proven using a concentration inequality for Gaussian random variables from 
[2] stated as Lemma H] in Appendix IB.ll . □ 

Definition 4. Recall that regret(t) was defined as the regret at time t, regret(t) = M — 

bi(t)(t) T Define regret! (t) = regretit) — I(E(t)). 

Next, we establish a super-martingale process that will form the basis of proving our high- 
probability regret bound. 

Definition 5. Let 
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and Y t := £t=i X w , where p = 

Lemma 2. (Yj;i > 0) is a super-martingale process with respect to filtration J-f 
Proof. We need to prove that for all t £ [1,7"], 

-E[regret'(i)|^i] < Pr (<(*) = <*(*) I ^l) + fi j- j ^ H 



( v / 41ri(iVT) t; + £(T)) P ' L MW 1 ^ ' pT 2 ' 



Let g(T) = (y / 41n(A^T) f + £(T)). Note that whether -E(i) is true or not is completely deter- 
mined by Tt-\. If Tt-i is such that E(t) does not hold, then regret' (t) = regret(i) — I(E(t)) < 0, 
and the lemma holds trivially. So, we will prove the above lemma while assuming we are given an 
J-t-i such that E(t) holds. 

Let E s (t) denote the event that some (by definition suboptimal) saturated arm i in C(t) exceeds 
all the suboptimal unsaturated arms at time t, i.e., 

E s (t) : 3i e C(t), such that Vj $ C(t),j / i*(t), 6i(t) > 6j(t). 
We prove the following lower bound on the probability of playing the optimal arm, 

Pr (t(t) = i*(t) | JU) > pPr (i? s (i) | E{t),F t - X ) - 

And we prove that 

^yEfregret'^l^-i] < Pr (E s (t) | E(t),F t ^) s t>i * {t) + E [s tm \ T t -i] + 

to get the desired inequality. 

For the lower bound on Pr (i(i) = | ^t-i), 

Pr(i(i) = i*(i)|Ji_ 1 ) > Pr(i(t) = i*(t),^ s (t),^(t)|Ji-i) 

= Pr(i(i) = | E s (t),E{t),T t -i) ■ Pr(E,(t) \ E{t),F t ^) ■ Pr(E{t) \ T t -i) 

(1) 

Now, given E(t), and Tt—\ such that E(t) is true, using the definition of saturated arms, it holds 
that for all i G C(t), 

0i{t) < bi(t) T fi + g(T)8 t ,i < b l {t) T ^ + Ai(t) < bi* {t) {t) T fi, 
and given E s (t), it holds that there exists j S C(t) with 

max^ c(t) 0i(t) < 0j(t) < bjitffi + Aj(t) < b m (t) T 
so that for all arms i, 0i{t) < bi* {t) (t) T fi. Therefore, 

Pr(*(t) = i*(t) | E s (t),E(t),T t -i) > Pr(^ (i) (t) > &i*(t)(t) T /i | E a (t),E(t),F t -i) 



Pr((?i. ( i)(t) > 6i.(t)(t)V|%)W^t-i) 

l 

— 



> Pr(e i . (t) (<)>6 i . (t) (t) T /*|7i_i) 



The equality in above holds because events E s (t) and Ei{t),\/i ^ i*(t) do not concern the optimal 
arm, and given Tt-\ (and hence i*(t), bi*r t \(t), fi(t), and B(t)), 6*i*(t)(i) is independent of these 
events. For the last inequality, we use that for any two events A, B, Pr(^4) < Pi(A\B) + Pr(B). 

In Lemma El we prove a lower bound of p on the probability of 0i*(t){t) to exceed the optimal 
mean reward b^*^ (t) T /V(t) given Tt-\ such that E(t) holds. This will be proven using concentration 
provided by E(t) and anti-concentration of Gaussian random variable 9i*(t)(t). Using Lemma [3j 



S 



Pr(i(t) = i*(t)\E s (t),E(t),F t ^)>p-^ 
Substituting this along with Pr [E(t) Tt-\ ) > 1 — ^ in Equation ([T]), we get 



Pr (i(t) = i*(t) | Tt-i) > pPr[E s (t) Ei(t),F t 



't-i 



2 



(2) 



For the regret upper bound, we observe that given E(t), and J~t—\ such that E(t) holds, if an 
arm i is played at time t, then Aj(t) < g(T)(st^ + s tii *u-\). This holds because if an arm i is played 
at time t, then it must be true that Oi{t) > And, given E{t) and E(t), 

h{t) T fi > 6i(t) - g{T)s t ,i 

> 0i*(t)(t) - g(T)s t ,i 

> k*(t) {tfv ~ g{T)s t j»( t ) - g(T)s t>i . 

Also, by definition of unsaturated arms, for every unsaturated arm i, Aj(t) < g{T)st ) i- Therefore, 

E [regret' (t) | F t -i] < E fe i6C(t ) *i(t)I(i = i(t)) Jul + E fei^t),^* (t) sCO^i = i(t)) 



i-l 



(*) < ®\Y,ieC(t)Mt)I(i = m E{t),F t A+^ + g{T)¥.[s m) I{i{t)tC{t))\T t _ l ] 



< E 



{g(T)s tji * (t) + g(T)s tm ) l(i(t) e C(t)) E(t),F t 
+g(T)E [s tm I{i(t)^C{t))\F t ^] 



t-i 



+ 



T 1 



(**) < g(T)s t>iHty E 



I{i{t) G C(t)) E(t),.Ft-iJ + ^ +5(T)E [s M(t) | (1 - 

< (g(T)s t)i * {t) ) Pr (E s | £(t), + ^ + g(T)E [a t|i(t) \ F t -i] (1 + 

< («/(T)a M . (t )) Pr E(t) ) J c i_i)+^+^(r)E[ atji(t) |JVi]+ 3 $ i (3) 



For the inequality marked (*), we use that for any random variable A < 1, event -B, and F, K[A\F] < 
E[A\B,F] + 1 - Pr(B\F). We use this with A = £ ieC(t) A,(t)/(i = i(t)) < £ ieC(t) ll^(t)(*)ll " 
llAWt) = < 1) ^ — E{t),F = J^-i. For the inequality marked (**), we use that for any 
events A,B,F, Pr(A\F) > Pr(A|i?, F) Pv(B\F). We use this with A = I(i(t) <£ C(t)),B = E(t), 
F = F t -\. For the last inequality, we use that s ti ^ < \\bi^{t)\\ < 1. □ 

The next lemma lower bounds the probability that the sample 0i*(t)(t) °f the optimal arm at 
time t will exceed its mean reward. 



Lemma 3. For any filtration Ft-\ such that E(t) is true, 

Pr (0i. (t )(t) > bi^tfn \jF t -i) 



> 



2e\ArT e 

Proof. Given event E(t), \b i *^(t) T fi(t) — b i *^{t) T n\ < £(T)s tji *( t y And, since Gaussian random 
variable #j*( t )(i) has mean bi*^(t) T p,(t) and standard deviation vs t ^*^, using anti-concentration 
inequality in Lemma 01 we will prove that with probability at least - J TE , $i*(t)(0 will exceed 
bi*(t)(t) T £i(t) + (in y) vs t Then, the proof will follow from observing that bi*^(t) T fi(t) + 



(in 21) > b i *(j){t) T fi(t) + l(T)s t ^(£} > (t) T /x. The details of the proof are in Appendix 



□ 



Now, we are ready to prove Theorem [2j 
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3.1 Proof of Theorem [2] 

Note that for super-martingale Yj, 

\Y t - Y t -i\ = \X t \ = \ WiHN i )v+m) regret'(t) - = .-(t)K„ (t) - ^ - ^| < 



The last inequality holds because for any i, st,i = \/bi(t) T B 1 (t)bi(t) < ||&i(i)||2 < 1- Therefore, 
by Azuma-Hoeffding inequality, 

Pr (Yt -Y > l^/TUm) < exp (^M) < 5/2. 



Therefore with probability 1 



2' 



Also, 



p Ei=l = + £t=l s M(i) + J? = p X)t:i(t)=t*(t) S *,i*(*) + Et=l s *,i(t) + pT 

^ | Et=i s *,i(*) + Ya=i s t,i(t) + 



= o(VrVrdinT) 

For the last inequality, we use that Et=i s i,«(i) ^ 5V dT In T, which can be derived along the lines 
of Lemma 3 of [9] using Lemma 11 of [3]. Details are in Appendix IB. 31 Therefore, with probability 

l- £ 

1 2' 



l£=i regret'(t) < (^Aln(NT) v + t{T)) ■ [O(VT^VTdlnT) + ^/Tln( 



[ d A /^lniVlnTlni 



Also, because holds for all t with probability at least 1 — ^ (refer to Lemma [T]), regret' (t) = 
regret (t) for all t with probability at least 1 — f . Hence, with probability 1 — 5, 

K{T) = Zl=i regret(i) = (dy^ In AT In T In ^ . 

To obtain bounds for the other definition of regret in Remark O observe that the expected 
regret for this definition is the same as before, 

E[regret(t)] = E[r> (t) (t)-r i(i) (t)] = E[E[r i , (i) (t)|i*(t)]]-E[E[r i(t) (i)|i(t)]] = E[6 it(t) (t) T ' (i-b i(t) (t) T >]. 

Therefore, Lemma [2] holds as it is, and If defined in Definition [5] is a super- martingale with respect 
to this new definition of regret (t) as well. Now, if |rj(i)| < R for all i, then | regret' (t)\ < R and 
\Yt — Yt—i\ < | and we can apply Azuma-Hoeffding inequality exactly as in this subsection to 
obtain regret bounds of the same order as Theorem [5] for the new definition. 
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4 Extensions 



4.1 TV different parameters 

Theorem Q] considers the setting where each arm i is associated with a parameter /ij, where possibly 
fii ^ fii' for two different arms % and i! . In this case, Thompson Sampling would maintain a separate 
estimate of mean fii(t), and Bi(t) for each arm i which would be updated only at the time instances 
when i is played. The statements of Lemma (TJ and the super-martingale property established by 
Lemma [2] will hold as it is for the new definitions. The only difference will appear in the bound 
for ^2 t s tji M used in the proof of Theorem [2l For the case of N different parameters, we will get a 
bound of 0{yj NTd InT) on this quantity instead of 0(VTd InT), leading to the extra y/~N factor 
in the bound in Theorem [1] compared to Theorem [2j The details of the algorithm for the case of iV 
different parameters, and the changes in the analysis required for proving Theorem [T] are provided 
in Appendix [Dl 

4.2 General distributions 

In the algorithm in this paper, 9i(t) is generated from a Gaussian distribution. However, the 
analysis techniques in this paper are easily extendable to an algorithm that uses a posterior distri- 
bution other than the Gaussian distribution. The only distribution specific properties we have used 
in the analysis are the concentration and anti-concentration inequalities for Gaussian distributed 
random variables mentioned in Lemma |H The concentration inequality was used to prove that 
E{t) happens with high probability in Lemma [TJ and the anti-concentration inequality was used to 
lower bound the probability that Gaussian distributed random variable iif ^{t) exceeds its mean 
by some factors of its standard deviation in Lemma El If any other distribution provides similar 
tail inequalities, these inequalities can be used as a black box in the analysis, and the regret bounds 
can be reproduced for that distribution. 

5 Conclusions 

We provided a theoretical analysis of Thompson Sampling for the contextual bandits problem with 
linear payoffs. Our results resolve many open questions regarding the theoretical guarantees for 
Thompson Sampling, and establish that even for the contextual version of the stochastic MAB 
problem, TS achieves regret bounds comparable to the state-of-the-art methods. We used novel 
martingale-based analysis techniques which are simpler than those in the past work on TS [3j [15] , 
and amenable to extensions. In fact, the techniques introduced in this paper could also be used to 
provide a simpler proof for the optimal expected regret bounds for TS for the basic MAB problem 
studied in [3l [15] . The proof of this claim will appear elsewhere. 

Several questions remain open. A tighter analysis that can remove the dependence on e is 
desirable. We believe that our techniques would adapt to provide such bounds for the expected 
regret. Other avenues to explore are contextual bandits with generalized linear models considered 
in [11], the setting with delayed and batched feedbacks, and the agnostic case of contextual bandits 
with linear payoffs. The agnostic case refers to the setting which does not make the realizability 
assumption that there exists a vector /ij for each i for which E[rj(t)|6j(i)] = To our 

knowledge, no existing algorithm has been shown to have non-trivial regret bounds for the agnostic 
case. 
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A Posterior distribution computation 



Pv(fj,\n{t)) cc 



oc 



oc 



oc 



oc 



oc 



Pr(r i (t)|6 i (t) r / S)Pr(/i) 

exp {--L((ri(t) - fbi(t)) 2 + 02 - (i{t)) T B{t)(fi - 
2v z 

exp{--^(r l (t) 2 + ffbi^biitffi + fi T B{t){L - 2^ T b t (t)r t {t) - 2{i T B{t)fi{t))} 
2v z 

e W {-^(fl T B(t + l)/2 - 2fi T B(t + l)£(t + 1))} 
2v z 

exp{-^ ^ " A(< + ±)) T B(t + l)(/2 - A(* + 1))} 
2v z 

A/x/Ht+ijy^+ir 1 ) 
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Therefore, the posterior distribution of \i at time t + 1 is M{jX{t + 1), v 2 B(t + 1) 1 ), 

B Proof of Theorem [2] 

B.l Gaussian concentration 

Formula 7.1.13 from [2] can be used to derive the following concentration and anti-concentration 
inequalities for Gaussian distributed random variables. 

Lemma 4. J§|/ For a Gaussian distributed random variable Z with mean m and variance a 2 , for 
any z > 1, 

-e~ z2 l 2 < PrflZ -m\> za) < -^e~ z ^ 2 . 



2^/tTZ y^TTZ 

B.2 Proof of Lemma Q] 

We will use the following lemma (implied by Theorem 1 in PQ): 

Lemma 5. Jiy Let (J~' t ;t > 0) be a filtration, (m t ;t > 1) be an M. d -valued stochastic process such 
that mt is (J-' t _i) measurable, (rj t ;t > 1) be a real-valued martingale difference process such that rj t 
is {J-' t ) measurable and For t > 0, define £f = Ylu=i m u r Ju and Mt = Id + Yl u =l m u m u> where Id is 
the d-dimensional identity matrix. Assume rjt is conditionally R-sub- Gaussian. 
Then, for any 5' > 0, t > 0, with probability at least 1 — 5', 



11611m.- 1 <R\ din 



t+1 
5' 



We use the above lemma with m t = b^it), r\t = r^ t ) — b i t t \{t) T ix, F' t = (m u+ i,r] u : u < t). 
(Note that effectively, T[ can be imagined to have all the information including the arms played 
until time t + 1, except for the reward of the arm played at time t + 1). By definition of J-[, mt is 
J-[_\ measurable, and n t is T[ measurable. And, r\t is a martingale difference process: 



Also, this makes 



E [m\^t-i] = E[r i{t) \b m (t),i(t)] - b l(t) (t) T Li = 0. 

t t 
M t = I d + ^ m u m u =I d + ^ h{u) ( u )bi(u) (u) T , 



u=l u=l 



& = J }2 m uVu = ^2b i{u) (u){r i{u) - b i(u) (u) T fi). 

u=l u=l 

Note that B(t) = M t -i, and fl(t) -fi = MfJife-l ~ M), so that 

\bi{t) T fi{t)-Ht) T V\ = \bi{t) T Mt\{Zt-i-yL)\ < H^^IU-iJiet-i-MlU^ = \\bi(t)\\B(t)-^\\Ct-i-^\\ Mt -_\- 

The inequality holds because M^\ is a positive definite matrix. Using the above lemma, for any 
5' > 0, t > 1, with probability at least 1 — 5', 
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Therefore, ||£t_i - lA\ M -i < RJ din (jr) + ||//|| M -i < RJ din (jr) + 1. Substituting 5' = we 



get that with probability 1 — for alH, 



Mtffiit) - h{t) T ^\ < 8t,i ■ \ R]Jdln(?j) + lj < l(T)8 t ,i 

This proves the bound on the probability of E(t). To prove bound on probability of Ei(t), we 
use the fact that since 9i(t) is distributed as N(bi(t) T fi(t),v 2 bi(t) T B(t)~ l bi(t)), therefore, using 
concentration inequalities for Gaussian random variables, 

Pr(|0i(t) - bi{t) T jl{t)\ > z vJbi(t) T B(t)-%(t)\T t ^) < -^e-* 2 ' 2 



Substituting z = v / 41n(iVT) , we get the desired bound. 
B.3 Bound on the sum of s t ^{t) 

We will use the following result, implied by the referred lemma in [3] 

Lemma 6. [[3], Lemma 11]. Let A 1 = A + xx T , where x £ M. d ,A,A' G M dxd , and all the eigenvalues 
Xj,j = 1, . . . ,d of A are greater than or equal to 1. Then, the eigenvalues A'-, j = 1, . . . , d of A' can 
be arranged so that \j < X'j for all j, and 

d y _ ^ . 
x T A~ 1 x < 10 ^ j 3 



j= i ■ J 



T 



Let Xj t t denote the eigenvalues of B(t). Note that B(t + 1) = B(t) + bnt)(t)bi{t){t) > an d 
Aj,t > l,Vj. Therefore, above implies 



-> A jit+ i - A 3 - )t 



This allows us to derive the following along the lines of Lemma 3 of [9] . 

T 

£ s t,i(t) - S^dTlnT. 



t=i 



C Proof of Lemma [3] 

Given event E(t), \bi*^(t) T fi(t) — b i *^(t) T fi\ < £(T)s t i *( t y And, since Gaussian random variable 
Oi*( t ){t) has mean &j*( t )(£) T /t(i) and standard deviation vs tji *u\, using anti-concentration inequality 
in Lemma HI 



t-1 



1 



> — =e~ z '" 



2^ 
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where 



b it{t) (t) T fi-b if{t) (t) T fl(t) 



VS t ,i*( t ) 



< 



< 



< 



VS t ,i*(t) 



-(InT+l) 



Pr {6 iHt) (t) > h, {t) {t) T ^ | JU) > ^ e_f 



(InT+l) 



1 



D N different parameters: Proof of Theorem [T] 

Theorem Q] considers the setting where each arm i is associated with a parameter /ij, where possibly 
[H ^ ^ for two different arms % and i' . In this case, Thompson Sampling would maintain a separate 
estimate of mean fii(t), and Bi(t) for each arm i which would be updated only at the time instances 
when % is played. 

t-i 

Bi(t) = I d + Yl Hu)k(u) T 

u=l:i(u)=i 
\u=l:i(u)=i J 

The posterior distribution for each arm i at timet would be N (bi(t) T jli(t) , v 2 6i(t) T J B i (t)" 1 6 i (t)). 
Algorithm 2: Thompson Sampling for Contextual bandits with N parameters 
Set Bt = I d , (ii = Q d , i = 1, . . . , N, fa = d . 
foreach t = 1, 2, . . . , do 

For each arm i = 1, . . . , N, sample 9i{t) independently from distribution 
N{b t {t) T k,v 2 bi(t) T Br\(t)). 

Play arm i(t) := argmaxj#j(t) and observe reward r^. 

Update B i( t) = B i{t) + ^( 4 )(t)6i(t)(t) T , fat) = fi(t) + &i(t)(*)n, = B ut)fi(t)- 
end 

In the regret analysis, the events E(t) will now be defined with respect to concentration of all 
fii(t) around their respective means. That is, 

E(t) : Vi, bi{t) T pLi(t) e Mtffi, - £(X)s t ,i, bi(t) T tn + KT)s t ,i] 
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Similarly, Ei(t) will be the event that 



Oi(t) G Mt) T ^(t) - y/4]n(NT) vs t>i , b^tf '&(<) + ^/Aln(NT) vs tii ], 

and E(t) will be the event that Vi, Ei(t) holds. It is easy to observe that the statements of Lemma 
[I] and the super-martingale property established by Lemma [2] will hold as it is for these new 
definitions. The only difference will appear in the bound for Y2t s t,i(t) used in the proof of Theorem 
[2j For the case of N different parameters, we will get a bound of 0(V NTd In T) on this quantity. 

Let rii(T) be the number of times arm i is played by time T. Then using Lemma [61 for two 
consequent time steps t,t' at which arm i is played 



d 



Aj ) t ' - Xj, t 



This allows us to derive the following lemma along the lines of Lemma 3 of [9] . 
Lemma 7. [[9], Lemma 3] For i = 1, . . . , N, 

T 

t=l:i(t)=i 

Using above lemma, 

T NT N i 



t=l i=l t=l:i{t)=i i=l V * 

Therefore, following the same lines as proof of Theorem[2j we will get a reg ret bound of 0{d x f^V NT In N In T In i 
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